Skip to content

Add auto cv pipeline mulit buffer enable and preload support#620

Open
zhangstevenunity wants to merge 13 commits intomainfrom
codex/cv-preload-scope-marking
Open

Add auto cv pipeline mulit buffer enable and preload support#620
zhangstevenunity wants to merge 13 commits intomainfrom
codex/cv-preload-scope-marking

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

No description provided.

zhangstevenunity and others added 12 commits May 2, 2026 20:20
PR #615 added the SyncCodegen scaffolding for multi-buffer set/wait_flag_dyn
ops, but the upstream analysis path was inert: GetEventIdNum hard-coded
return 1, and PTOIRTranslator dropped every pto.pointer_cast addr operand
beyond the first. The N>1 multi-buffer code path was therefore unreachable.

This change makes the PlanMemory -> InsertSync handshake actually work:

- PTOIRTranslator::UpdatePointerCastOpMemInfo now translates *all* N
  pto.pointer_cast i64 addr operands into BaseMemInfo.baseAddresses
  (constants -> bit offsets, non-constants -> hasVariableAddress=true).
  Subview/alias deltaOffset is applied to every slot, not just slot 0.

- MemoryDependentAnalyzer gains getMultiBufferSlotCount(a, b) implementing
  HIVM's per-slot geometry check: same-index slots must overlap (real
  back-edge dep on the same physical buffer) and different-index slots must
  NOT overlap (so consecutive iterations land in disjoint physical buffers).

- InsertSyncAnalysis::GetEventIdNum is rewritten to follow HIVM semantics:
  every dependent pair must be multi-buffer-eligible, all pairs must agree
  on N, and every involved buffer must hang off the same scf.for.

Verified end-to-end: a `pto.multi_buffer = 2` alloc inside scf.for now
emits 2 reserved event ids, an `iv mod 2` arith.select chain, and
`pto.{set,wait}_flag_dyn` ops driven by the selected idx. New regression
guards this in test/lit/pto/multi_buffer_insert_sync_dyn_event_id.pto.

Co-Authored-By: Claude <noreply@anthropic.com>
Builds on P0 to make the multi-buffer pipeline robust under realistic
pressure:

- SyncEventIdAllocation: HIVM-style TryFallbackMultiBuffer. When the
  event-id pool can't satisfy N, fall back to N=2 first (when N is even
  and >2) then to 1, instead of silently returning with an empty
  eventIds vec. Both set/wait siblings get their `eventIdNum` updated
  in lockstep so SyncCodegen takes the same code path on each side.

- PTOPlanMemory: add `StorageEntry::isMultiBufferSlot` flag and rewrite
  `UpdateBuffer2Offsets` to walk via primaries and emit slots in
  declared `relationOtherBuffers` order. This makes the slot-ordering
  contract that EnableMultiBuffer relies on (`offsets[i]` <-> iv mod N
  selector index `i`) explicit and verifies it via a runtime assertion.

- PTOEnableMultiBuffer: add scope guard (skip non-VEC/MAT casts where
  the iv-mod-N selector is not meaningful) and loop-invariance guard
  (skip casts whose addrs are not loop-invariant - hoisting them above
  the for loop would break SSA dominance).

- New regression at test/lit/pto/multi_buffer_n4_insert_sync.pto: N=4
  slot ordering [0, 1024, 2048, 3072] in pto.pointer_cast, 3-way
  arith.select chain, and 4 distinct event ids drained at function end.

Co-Authored-By: Claude <noreply@anthropic.com>
Final hardening pass tightening multi-buffer correctness around the edges
PR-615 and earlier P0/P1 commits left ambiguous.

- InsertSyncAnalysis: GetEventIdNum now takes the back-edge scf.for and
  requires every involved buffer's enclosing loop to *equal* that loop.
  Previously a multi-buffer alloc nested inside an inner loop could
  silently rotate slots on the wrong induction variable when a back-edge
  on an outer loop carried the dep. Forward deps unconditionally collapse
  to single-buffer.

- PTOEnableMultiBuffer: cache `iv mod N` per (loop, N) so several
  multi-buffer pointer_casts sharing a loop reuse one counter, mirroring
  SyncCodegen::loop2BufferCounter. Avoids redundant arith.remui +
  constant ops in the loop body.

- PTOPlanMemory:
  * `MemPlan::emitMultiBufferError` + `multiBufferDiagnosticEmitted_`
    flag converts multi-buffer-specific `report_fatal_error` calls
    (slot-order mismatch, pong outline placement failure) into
    diagnostics that bubble through `plan()` as `failure()`. This lets
    the existing PlanMemoryPass retry loop re-seed instead of aborting
    the compiler under heavy multi-buffer memory pressure.
  * `VerifyConflictStage1` now enumerates *every* historical
    multi-buffer slot offset (via `CollectMultiRelationPongAnchors`)
    as a SPEC_LEVEL_1 reuse anchor candidate. PR-615 only used the last
    slot, silently dropping reuse for N > 2.

- New regression at test/lit/pto/multi_buffer_nested_loop.pto: nested
  scf.for with multi-buffer alloc in the inner loop must rotate slots on
  the inner iv, not the outer one.

Co-Authored-By: Claude <noreply@anthropic.com>
`VerifyConflictStage2`'s contract (per its own comment) is "block reuse
only when buffers (a) share a parent loop AND (b) have a DMA pipe
conflict". `PipeConflictInSameLoop` violated that on two counts:

1. After confirming same parent loop, it returned true unconditionally
   instead of querying `PipeConflict` for the actual DMA pipe-conflict
   relation. SPEC_LEVEL_2 was therefore stricter than SPEC_LEVEL_3
   semantically: any same-loop pair was rejected, even when no DMA
   pipe conflict existed.

2. `GetBufferParentLoop` returns nullptr for top-level buffers; two
   top-level buffers compared parentLoop1 == parentLoop2 == nullptr,
   so the "different loops -> allow reuse" early-return was bypassed
   and the function fell through to "return true". Almost every
   cross-buffer pair at function scope was getting marked as
   conflicting, blocking valid reuse and causing local-memory
   planning to fail/overflow in cases that previously fit.

Fix: reject the nullptr case explicitly, then fall through to the
real `PipeConflict` query before declaring a conflict.

166/166 lit tests still pass; the change can only widen the set of
allowed reuses, never narrow it.

Co-Authored-By: Claude <noreply@anthropic.com>
After merging upstream/main, multi-buffer-enabled loops emitted spurious
`pto.barrier <PIPE_MTE2>` and `pto.barrier <PIPE_MTE3>` ops in addition
to the multi-buffer `set_flag_dyn` / `wait_flag_dyn` pair, e.g.:

  pto.wait_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2]
  pto.barrier <PIPE_MTE2>            <- redundant
  pto.tload                           ; (MTE2)
  pto.set_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>]
  pto.wait_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>]
  pto.barrier <PIPE_MTE3>            <- redundant
  pto.tstore                          ; (MTE3)
  pto.set_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2]

Both barriers came from `InsertSyncOperation`'s same-pipe back-edge
branch unconditionally emitting `PIPE_BARRIER` for any cross-iteration
dep on a single pipe. Two of these were truly redundant:

1. Same-pipe back-edge whose dep is multi-buffer eligible. The whole
   point of multi-buffer is that consecutive iterations land in
   disjoint physical slots, so the cross-iter "same-address" dep is
   fundamentally false; the multi-buffer dyn flag pair already does
   the cross-iter ordering. No barrier needed.

2. Same-pipe back-edge on a DMA pipe (MTE1/MTE2/MTE3/MTE4/MTE5). DMA
   pipes are simple in-order command queues; the hardware itself
   serializes consecutive ops on the same DMA pipe across iterations.
   PIPE_BARRIER on a DMA pipe is a no-op and just clutters the IR.

Fix: in `InsertSyncOperation`, compute the multi-buffer eventIdNum once
(also resolves the back-edge scf.for so the cross-pipe path can reuse
it - small refactor of the existing logic) and skip the barrier when
either condition above holds. PIPE_M / PIPE_V keep the conservative
PIPE_BARRIER to preserve the "bar_m" / "bar_v" intra-pipe semantics
that higher-level frontends rely on (the existing comment in
`IsNoNeedToInsertSync` calls this out).

168/168 lit pass, including all multi-buffer regressions and the
upstream-merged comm/sync tests.

Co-Authored-By: Claude <noreply@anthropic.com>
…ehmann-807dab

# Conflicts:
#	lib/PTO/Transforms/CMakeLists.txt
#	tools/ptoas/ptoas.cpp
Wire multi-buffer support into the GraphSyncSolver path so it matches the
behaviour the InsertSync path got in earlier P0/P1/P2 commits. Mirrors the
intra-core subset of `bishengir/lib/Dialect/HIVM/Transforms/GraphSyncSolver`
(`SyncSolver::getEventIdInfo`, `getMultiBufferEventIdInfo`,
`getMultiBufferLoop`, `SyncSolverCodeGen::getMultiBufferSelectOp`).

Detection (Utility.h, SyncSolver.h/.cpp):
- `EventIdInfo` (eventIdNum + multibufferLoop) carried per ConflictPair.
- `getMultiBufferLoop` finds the common scf.for whose iv drives the slot
  rotation; reuses the per-slot overlap helper added in P0
  (MemoryDependentAnalyzer::getMultiBufferSlotCount), so the same
  same-index-overlap / different-index-disjoint geometry rule applies.
- `getMultiBufferEventIdInfo` requires every dependent pair to agree on N
  and share the same scf.for; failure -> single-buffer.
- `getEventIdInfo` is the top-level wrapper with the backward-only gate
  (mirrors HIVM's "only optimise back-edge deps").

Solver wiring (handleSetWaitConflict):
- Threads rwOp1/rwOp2 in so MB info is computable.
- Allocates N event ids via the existing Welsh-Powell coloring node
  (EventIdNode already supports eventIdNum > 1).
- Falls back to N=1, then PIPE_ALL barrier, if N ids cannot be coloured.

Codegen (SyncSolverCodeGen.h/.cpp):
- New `emitMultiBufferSetWait` emits the HIVM/InsertSync output shape:
  pre-loop N pto.set_flag (queue prime), in-body iv-mod-N counter +
  N-way arith.select chain + pto.{wait,set}_flag_dyn, post-loop N
  pto.wait_flag (queue drain). The dyn ops live INSIDE the loop body so
  they share the selector's dominance (GSS's default backward-sync
  anchors are at the loop boundary, which works for single-buffer but
  not for a per-iteration selector).
- `loop2BufferCounter_` cache reuses one `iv mod N` across multiple
  ConflictPairs sharing a (loop, N) tuple.

Test:
- New `multi_buffer_gss_dyn_event_id.pto` is the GSS counterpart of the
  existing InsertSync N=2 regression; checks pre-loop primes, the
  selector chain, dyn set/wait, and post-loop drains.

191/191 lit pass.

Co-Authored-By: Claude <noreply@anthropic.com>
@zhangstevenunity zhangstevenunity changed the title Addcv preload scope marki Add auto cv pipeline mulit buffer enable and preload support May 5, 2026
@zhangstevenunity zhangstevenunity marked this pull request as ready for review May 5, 2026 07:44
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a CV Preload optimization framework for PTOAS, introducing new IR operations and passes to facilitate multi-buffering and software pipelining between Cube and Vector kernels. The implementation covers automatic multi-buffer marking, scope identification, and stage-guarded loop expansion, alongside significant updates to memory planning and synchronization solvers. Technical feedback highlights several critical areas for refinement: ensuring deterministic IR generation by avoiding DenseMap iteration, permitting pure operations like constants within loop bodies to align with the design specification, and addressing logic gaps in scope identification that currently miss leading or trailing operations. Furthermore, the reviewer recommends strengthening conflict analysis for top-level buffers to mitigate potential hardware hazards.

Comment on lines +288 to +290
SmallVector<LocalBufferBinding> orderedBindings;
orderedBindings.reserve(localBindings.size());
for (const auto &it : localBindings)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The iteration over localBindings (which is a DenseMap) is non-deterministic. This causes the order of cloned PointerCastOp and BindTileOp operations at the beginning of the expanded loop body to vary across different compiler runs. While the resulting code is functionally equivalent, non-deterministic IR generation makes debugging, testing, and IR diffing significantly more difficult.

Consider using a MapVector or maintaining a separate SmallVector to track the insertion order of bindings during the walk.

Comment on lines +318 to +321
if (!scope) {
return op.emitOpError(
"non-CV-scope op cannot be preserved by pto-cv-create-preload yet");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check is overly restrictive. It causes the pass to fail if the loop body contains any operation not wrapped in a pto.cv.scope, including pure operations like arith.constant or index arithmetic. The design document (Section 7.8) explicitly mentions that constants and index arithmetic should be allowed and cloned for each stage.

Restricting the loop body to only CVScopeOp makes the pass fragile and forces users to wrap trivial setup code into scopes unnecessarily.

Comment on lines +290 to +297
if (action.isTPop) {
if (pending.active)
includeMissingThrough(pending, op.getPrevNode());
collectScopeIfValid(pending, core, scopes);
pending.input = action.input;
includeRange(pending, &op, &op);
segmentStart = &op;
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In collectScopesInFor, when the first TPipe operation encountered is a tpop, any operations preceding it in the loop body (e.g., induction variable adjustments or constant definitions) are not included in any scope. Since CVCreatePreloadPass requires all operations in the loop body to be wrapped in scopes, this gap will cause the expansion pass to fail on valid IR.

The logic should ensure that leading operations are either pulled into the first scope or that the identification process covers the entire block range.

Comment on lines +341 to +344
if (pending.active)
includeMissingThrough(pending, terminator ? terminator->getPrevNode()
: nullptr);
collectScopeIfValid(pending, core, scopes);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the issue with leading operations, if the loop body contains non-TPipe operations after the last identified scope has been closed (i.e., pending.active is false), these trailing operations are missed and left unwrapped. This will lead to expansion failures in the subsequent pto-cv-create-preload pass.

Comment on lines +2565 to +2567
if (!parentLoop1 || !parentLoop2) {
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In PipeConflictInSameLoop, returning false when both parentLoop1 and parentLoop2 are nullptr allows top-level buffers (defined at function scope) to reuse the same memory offsets even if they have conflicting pipe usage and overlapping lifetimes.

While SPEC_LEVEL_2 is intended to be less restrictive than SPEC_LEVEL_3, top-level buffers are effectively in the same execution scope. Allowing them to alias while being used by conflicting hardware units (e.g., MTE and Vector) simultaneously could lead to hardware hazards. It would be safer to check PipeConflict if both are nullptr.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 5, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Add auto cv pipeline mulit buffer enable and preload support #620 Add auto cv pipeline mulit buffer enable and preload support
  • Author: zhangstevenunity
  • Base/Head: main / codex/cv-preload-scope-marking
  • Head SHA: ef2b6c6a3f71
  • Trigger: PR 有新提交
  • Generated At: 2026-05-06T11:54:25Z
  • Previous Head SHA: aa5ff4602e09
  • Status: completed

Summary

PR #620 introduces two P1 correctness regressions in the new CV multi-buffer/preload pipeline and one P2 error-handling bug in PlanMemory.

Findings

  1. P1 Default pipeline auto-marks CV multi-buffering without also enabling buffer-slot rotation tools/ptoas/ptoas.cpp:1140

PTOCVAutoMarkMultiBufferPass and PTOCVMarkPreloadScopesPass are now always added to the default pipeline, but PTOEnableMultiBufferPass still only runs when --enable-multi-buffer-lowering or --enable-cv-create-preload is set. On a plain --enable-insert-sync / --enable-graph-sync-solver build, PlanMemory can therefore emit variadic pto.pointer_casts, the sync analyzers will use all N planned addresses to choose eventIdNum > 1, and SyncCodegen will emit wait_flag_dyn / set_flag_dyn rotation. EmitC later lowers the same pto.pointer_cast with adaptor.getAddrs()[0] only, so the runtime rotates event IDs but always reuses slot 0. That is a real miscompile for existing sync-only command lines on CV kernels, and it also changes memory planning even when the user never opted into multi-buffer/preload.

  1. P1 Create-preload leaves hoisted bind_tile-backed multi-buffer casts unlowered lib/PTO/Transforms/PTOEnableMultiBuffer.cpp:146

PTOCVCreatePreloadPass hoists cloned pointer_cast + bind_tile pairs above the rewritten loop. inferRotationLoopFromUses() only inspects the direct users of the pointer_cast; in the hoisted case that direct user is the hoisted BindTileOp, which is itself outside any scf.for, so the helper returns nullptr and the pass skips lowering entirely. The remaining variadic pto.pointer_cast then reaches EmitC, which again uses address 0 only. That means different preload stages alias the same physical local-buffer slot, breaking the advertised --enable-cv-create-preload path for normal tilebuf/bind_tile based kernels.

  1. P2 PlanMemory ignores multi-buffer offset-generation errors because the failure flag is checked too early lib/PTO/Transforms/PTOPlanMemory.cpp:1528

emitMultiBufferError() is meant to make the new retry loop reject a bad multi-buffer plan, but plan() checks multiBufferDiagnosticEmitted_ before calling UpdateBuffer2Offsets(). The new invariants added in UpdateBuffer2Offsets() (null relation slot, offset count mismatch) therefore set the flag too late: plan() still returns success and the pass continues with a partially built buffer2Offsets map. That defeats the retry/fail-fast logic and can feed invalid addresses into AllocToPointerCast instead of stopping cleanly.

P0 fixes (correctness, untested edges):
* CVCreatePreload: use llvm::MapVector for LocalBufferBinding so the cloned
  per-stage pointer_cast / bind_tile order is deterministic across builds
  (DenseMap iteration order had been leaking pointer-hash into the IR).
* Multi-buffer slot index now computes ((iv - lb) / step) mod N instead of
  iv mod N. The previous form silently dropped slots when gcd(step, N) > 1
  (e.g. step=2/N=4 only ever selected {0,2}). Fixed in three codegen paths
  (PTOEnableMultiBuffer, GSS SyncSolverCodeGen, InsertSync SyncCodegen);
  step=1, lb=0 still takes the single-remui fast path.
* GraphSyncSolver multi-buffer codegen now preserves the original sync
  anchors. cp->op1 / cp->op2 are syncIR-ordered, not MLIR-lexical-ordered;
  resolve via Operation::isBeforeInBlock so wait_flag_dyn lands before the
  iteration's first user and set_flag_dyn after the iteration's last user.
  Anchors in different blocks (e.g. nested scf.if) fall back to the prior
  body-start / before-terminator placement.

P1 fixes (semantic gaps):
* SyncSolver::getMultiBufferEventIdInfo now matches HIVM and PTO InsertSync:
  any dep pair with n<2 collapses the whole pair to single-buffer rather
  than being silently skipped.
* CVMarkPreloadScopes: ScopeClass key now includes parent scf.for so two
  unrelated loops sharing the same (core, input, output) pipe stay in
  separate classes; fan-out / fan-in along a logical pipe is detected and
  emits a diagnostic instead of silently dropping siblings.
* CVMarkPreloadScopes: TPipe ops nested under scf.if / scf.for inside the
  scope-recognition for now emit a warning and skip auto-marking, instead
  of being missed by the direct-children walk.
* CVCreatePreload: stage-rotated multi-address pointer_casts are hoisted
  above the rotation loop instead of being duplicated max*N times in the
  body. PTOEnableMultiBuffer learns to infer the rotation loop from a
  loop-invariant cast's users so the lowering still finds it.
* CVMarkPreloadScopes::canWrapNoResultScope walks every op (including ops
  inside nested regions of the moved range) when checking for SSA escapes,
  not just the top-level moved ops.

Tests:
* New regression test/lit/pto/multi_buffer_step_not_one.pto covers the
  step!=1 slot-index path.
* Existing cv_create_preload_scope_pipeline.pto updated for the hoisted
  stage pointer_cast layout.
* Full lit suite (194 tests) passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants